In this chapter you will learn how to read data from files, do some analysis and write the results to disk.
Reading and writing files is quite an essential part of programming as it is the first step for your program to communicate with the outside world. In most cases you will write programs that take data from some source, manipulates it in someway and writes some results out somewhere.
For example if you would write a survey, you could take input from participants on a webserver and save their answers in some files or in a database. When the survey is over you would read these results in and do some analysis on the data you have collected, maybe do some visualizations and save your results.
In NLP, you often process files containing raw texts with some code and write the results to some other file.
os
and glob
We use some materials from this other Python course.
If you have any questions about this chapter, please refer to the forum on Canvas.
In Python, you can read the content of a file, store it as the type of object that you need (string, list, etc.) and manipulate it (e.g. replacing or removing words). You can also write new content to an existing or a new file.
Here, we will discuss how to:
To open a file, we need to associate the file on disk with a variable in Python. First, we tell Python where the file is stored on your disk. The location of your file is often referred to as the file path.
Python will start looking in the 'working' or 'current' directory (which often will be where your Python script is). If it's in the working directory, you only have to tell Python the name of the file (e.g. charlie.txt
). If it's not in the working directory, as in our case, you have to tell Python the exact path to your file. We will create a string variable to store this information:
In [ ]:
filename = "../Data/Charlie/charlie.txt"
# The double dots mean 'go up one level in the directory tree'.
Sometimes you see double dots in the beginning of the file path; this means 'the parent of the current directory'. When writing a file path, you can use the following:
Consider the directory tree below.
You will learn how to navigate your directory tree quite intuitively with a bit of practice. If you have any doubts, it is always a good idea to follow a quick tutorial on basic command line operations.
Navigating your directory tree on Windows
Also note that the formatting of file paths is different across operating systems. The file path as specified above should work on any UNIX platform (Linux, Mac). If you are using Windows, however, you might run into problems when formatting file paths in this way outside of this notebook, because Windows uses backslashes instead of forward slashes (Jupyter Notebook should already have taken care of these problems for you). In that case, it might be useful to have a look at this page about the differences between the file systems, and at this page about solving this problem in Python. In short, it's probably best if you use the code below (we will talk about the os
module in more detail later today). This is very useful to know if you are a Windows user, and it will become relevant for the final assignment.
In [ ]:
# For windows:
import os
windows_file_path = os.path.normpath("C:/somePath/someFilename") # Use forward slashes
We can use the file path to tell Python which file to open by using the built-in function open()
. The open()
function does not return the actual text that is saved in the text file. It returns a 'file object' from which we can read the content using the .read()
function (more on this later). We pass three arguments to the open()
function:
The most important mode arguments the open()
function can take are:
Then, to open the file 'charlie.txt' for reading purposes, we use the following:
In [ ]:
filepath = "../Data/Charlie/charlie.txt"
infile = open(filepath, "r") # 'r' stands for READ mode
# Do something with the file
infile.close() # Close the file (you can ignore this for now)
Overview of possible mode arguments (the most important ones are 'r', 'w' and 'w'):
Character | Meaning |
---|---|
'r' | open for reading (default) |
'w' | open for writing, truncating the file first |
'x' | open for exclusive creation, failing if the file already exists |
'a' | open for writing, appending to the end of the file if it exists |
'b' | binary mode |
't' | text mode (default) |
'+' | open a disk file for updating (reading and writing) |
'U' | universal newlines mode (deprecated) |
We could also directly use the path in the `open()``;
In [ ]:
infile = open("../Data/Charlie/charlie.txt" , "r")
infile.close()
So far, we have opened the file. This, however, does not yet show us the file content. Try printing 'infile':
In [ ]:
infile = open("../Data/Charlie/charlie.txt" , "r")
print(infile)
infile.close()
This TextIOWrapper
thing is Python's way of saying it has opened a connection to the file charlie.txt
. To actually see its content, we need to tell python to read the file.
Here, we will discuss three ways of reading the contents of a file:
read()
readlines()
readline()
The read()
method is used to access the entire text in a file, which we can assign to a variable. Consider the code below.
The variable content
now holds the entire content of the file charlie.txt
as a single string and we can access and manipulate it just like any other string. When we are done with accessing the file, we use the close()
method to close the file.
In [ ]:
# Opening the file using the filepath and and the 'read' mode:
infile = open("../Data/Charlie/charlie.txt" , "r")
# Reading the file using the `read()` function and assigning it to the variable `content`
content = infile.read()
print(content)
print()
print('This function returns a', type(content))
# closing the file (more on this below)
infile.close()
The readlines()
function allows you to access the content of a file as a list of lines. This means, it splits the text in a file at the new lines characters ('\n') for you):
In [ ]:
# Opening the file using the filepath and and the 'read' mode:
infile = open("../Data/Charlie/charlie.txt" , "r")
# Reading the file using the `read()` function and assigning it to the variable `content`
lines = infile.readlines()
print(lines)
print()
print('This function returns a', type(lines))
# closing the file
infile.close()
Now you can, for example, use a for-loop to print each line in the file (note that the second line is just a newline character):
In [ ]:
for line in lines:
print("LINE:", line)
Important note
When we open a file, we can only use one of the read operations once. If we want to read it again, we have to open a new file variable. Consider the code below:
In [ ]:
infile = open("../Data/Charlie/charlie.txt" , "r")
content = infile.read()
lines = infile.readlines()
print(content)
print(lines)
infile.close()
The code returns an empty list. To fix this, we have to open the file again:
In [ ]:
filepath = "../Data/Charlie/charlie.txt"
infile = open(filepath , "r")
content = infile.read()
infile = open(filepath, "r")
lines = infile.readlines()
print(content)
print(lines)
infile.close()
The third operation readline()
returns the next line of the file, returning the text up to and including the next newline character (\n, or \r\n on Windows). More simply put, this operation will read a file line-by-line. So if you call this operation again, it will return the next line in the file. Try it out below!
In [ ]:
filepath = "../Data/Charlie/charlie.txt"
infile = open(filepath, "r")
next_line = infile.readline()
print(next_line)
In [ ]:
next_line = infile.readline()
print(next_line)
In [ ]:
next_line = infile.readline()
print(next_line)
infile.close()
Which function to choose
For small files that you want to load entirely, you can use one of these three methods (readline, read, or readlines). Note, however, that we can also simply do the following to read a file line by line (this is recommended for larger files and when we are really only interested in a small portion of the file):
In [ ]:
infile = open(filename, "r")
for line in infile:
print(line)
infile.close()
Note the last line of this code snippet: infile.close()
. This closes our file, which is a very important operation. This prevents Python of keeping files that are unneccessary anymore still open. In the next subchapter we will also see a more convenient way to ensure files get closed after their usage.
Here, we will intorduce closing a file with the method close()
and using a context manager to open and close files. After reading the contents of a file, the TextWrapper
no longer needs to be open since we have stored the content as a variable. In fact, it is good practice to close the file as soon as you do not need it anymore.
We do this by using the close()
method as already shown several times above.
In [ ]:
filepath = "../Data/Charlie/charlie.txt"
# open file
infile = open(filepath , "r")
# assign content to a varialbe
content = infile.read()
# close file
infile.close()
# do whatever you want with the context, e.g. print it:
print(content)
There is actually an easier (and preferred) way to make sure that the file is closed as soon as you don't need it anymore, namely using what is called a context manager. Instead of using open()
and close()
, we use the syntax shown below.
The main advantage of using the with-statement is that it automatically closes the file once you leave the local context defined by the indentation level. If you 'manually' open and close the file, you risk forgetting to close the file. Therefore, context managers are considered a best-practice, and we will use the with-statement in all of our following code.
In [ ]:
filepath = "../Data/Charlie/charlie.txt"
with open(filepath, "r") as infile:
content = infile.read()
print(content)
Once your file content is loaded in a Python variable, you can manipulate its content as you can manipulate any other variable. You can edit it, add/remove lines, count word occurences, etc. Let's say we read the file content in a list of its lines as shown below. Note that we can use all of the different methods for reading files in the context manager.
In [ ]:
filepath = "../Data/Charlie/charlie.txt"
with open(filepath, "r") as infile:
lines = infile.readlines()
print(lines)
Then we can for instance preserve only the first 2 lines of the file, in a new variable:
In [ ]:
first_two_lines=lines[:2]
first_two_lines
We can count the lines that are longer than 15 characters:
In [ ]:
counter=0
for line in lines:
if len(line)>15:
counter+=1
print(counter)
We will soon see how to perform text processing once we have loaded the file, by using an external module in the next chapter. But let's first write our modified file back to disk to preserve the changes.
To write content to a file, we can open a new file and write the text to this file by using the write()
method. Again, we can do this by using the context manager. Remember that we have to specify the mode using w
.
Let's first slightly adapt our Charlie story by replacing the names in the text:
In [ ]:
filepath = "../Data/Charlie/charlie.txt"
# read in file and assign content to the variable content
with open(filepath, "r") as infile:
content = infile.read()
# manipulate content
your_name = "x y" #type in your name
friends_name = "a b" #type in the name of a friend
# Replace all instances of Charlie Bucket with your name and save it in new_content
new_content = content.replace("Charlie Bucket", your_name)
# Replace all instancs of Mr Wonka with your friends name and save it in new_new_content
new_new_content = new_content.replace("Mr Wonka", friends_name)
We can now save the manipulated content to a new file:
In [ ]:
filename = "../Data/Charlie/charlie_new.txt"
with open(filename, "w") as outfile:
outfile.write(new_new_content)
Open the file charle_new.txt
in the folder ../Data/Charlie
in any text editor and read a personalized version of the story!
Note about append mode (a
):
The third mode of opening a file is append ('a'). If the file 'charlie_new.txt' does not exist, then append and write act the same: they create this new file and fill it with content. The difference between write and append occurs when this file would exist. In that case, the write mode overwrites its content, while the append mode adds the new content at the end of the existing one.
You will often have multiple files to work with. The folder ../Data/Dreams contains 10 text files describing dreams of Vickie, a 10-year-old girl. These texts are extracted from DreamBank.
To process multiple files, we often want to iterate over a list of files. These files are usually stored in one or multiple directories on your computer.
Instead of writing out every single file path, it is much more convenient to iterate over all the files in the directory ../Data/Dreams
. So we need to find a way to tell Python: "I want to do something with all these files at this location!"
There are two modules which make dealing with multiple files a lot easier.
We will introduce them below.
The glob
module is very useful to find all the pathnames matching a specified pattern according to the rules used by the Unix shell. You can use two wildcards: the asterisk (*
) and the question mark (?
). An asterisk matches zero or more characters in a segment of a name, while the question mark matches a single character in a segment of a name.
For example, the following code gives all filenames in the directory ../Data/dreams
:
In [ ]:
import glob
In [ ]:
for filename in glob.glob("../Data/Dreams/*"):
print(filename)
If we only want to consider text files and ignore everything else (here a file called 'IGNORE_ME!'), we can specify this in our search by only looking for files with the extension .txt
:
In [ ]:
for filename in glob.glob("../Data/Dreams/*.txt"):
print(filename)
A question mark (?
) matches any single character in that position in the name. For example, the following code prints all filenames in the directory ../Data/dreams
that start with 'vickie' followed by exactly 1 character and end with the extension .txt
(note that this will not print vickie10.txt
):
In [ ]:
for filename in glob.glob("../Data/Dreams/vickie?.txt"):
print(filename)
You can also find filenames recursively by using the pattern **
(the keyword argument recursive
should be set to True
), which will match any files and zero or more directories and subdirectories. The following code prints all files with the extension .txt
in the directory ../Data/
and in all its subdirectories:
In [ ]:
for filename in glob.glob("../Data/**/*.txt", recursive=True):
print(filename)
Another module that you will frequently see being used in examples is the os
module. The os
module has many features that can be very useful and which are not supported by the glob
module. We will not go over each and every useful method here, but here's a list of some of the things that you can do (some of which we have seen above):
os.mkdir()
, os.mkdirs()
;os.rmdir()
, os.rmdirs()
;os.path.isfile()
, os.path.isdir()
;os.path.split()
;os.path.join()
os.path.splitext()
os.path.basename()
, os.path.dirname()
.Feel free to ply around with these methods and figure out how they work yourself :-)
In [ ]:
# Start by importing the module:
import os
# let's use a filepath for testing it out:
filepath = "../Data/Charlie/charlie.txt"
os.path.basename(filepath)
Exercise 1:
Write a program that opens RedCircle.txt
in the ../Data/RedCircle
folder and prints its content as a single string:
In [ ]:
# your code here
Exercise 2:
Write a program that opens RedCircle.txt
in the ../Data/RedCircle
folder and prints a list containing all lines in the file:
In [ ]:
# your code here
Exercise 3:
Create a counter dictionary like in block 2 (the dictionaries chapter), where you will count the number of occurences of each word in a file.
In [ ]:
# your code here
Exercise 4:
The module os
implements functions that allow us to work with the operating system (see folder contents, change directory, etc.). Use the function listdir
from the module os
to see the contents of the current directory. Then print all the items that do not start with a dot.
In [ ]:
# your code here